W2: Data Cleaning 1

Catch Up on Exercises (20 Minutes)

  • Work on exercises, if you haven’t had the chance
  • Ask questions in chat

20:00

Last Week Today (Clearest)

“The difference between double and single brackets for subsetting data in lists”

“vector dataframes and lists”

“I think the pacing is great and I am understanding the material so far!”

Last Week Today (Muddiest)

“I think so far I am following the material, although it did take a while to grasp the difference between single and double brackets”

Remember to Hit Record in Teams

This week

  • Review mutate()
  • Use case_when()
  • Use mutate(across())
  • A little about functions

Review of mutate()

Modifying and creating new columns

The mutate() function takes in the following arguments: the first argument is the dataframe of interest, and the second argument is a new or existing data variable that is defined in terms of other data variables.

I think of them like formulas in Excel. . . .

penguins |>
  mutate(weight_flipper = body_mass_g * flipper_length_mm)

In Class: Mutate 1

02:00

Try calcluating a new variable in penguins called body_mass_kg by dividing body_mass_g by 1000.

Confirm your conversion was correct by comparing body_mass_g and body_mass_kg.

penguins_new <- penguins |> mutate(body_mass_kg = body_mass_g / 1000) head(penguins_new)
penguins_new <- penguins |>
  mutate(body_mass_kg = body_mass_g / 1000) 

head(penguins_new)

Recoding Variables

Source: Allison Horst

Recoding Data

Say we want to convert our body_mass_g column into a categorical column.

With the criteria:

  • 4000 or smaller, we want to call “small”
  • Greater than 4000, we want ot call “large”

We can write our criteria as follows:

body_mass_g <= 4000 ~ "small", # if body_mass_g is less 
                               # than or equal to 4000 return "small"
body_mass_g > 4000 ~ "large". # if greater than 4000 return "large"

Plugging in our criteria

We can use these criteria directly in a case_when() within a mutate()

Here we are defining a new variable called size in our dataset:

penguins |> head(10) |>
  select(species, island, body_mass_g) |>
  mutate(size = case_when(
    body_mass_g <= 4000 ~ "small",
    body_mass_g > 4000 ~ "large"
  ))

Question

What happens to the NA values in the body_mass_g column? What are they transformed to?

More criteria

With the criteria:

  • 3500 or smaller, we want to call “small”
  • Between 3501 and 4000, we want to call “medium”
  • Greater than 4000, we want to call “large”

For medium, we need to chain two comparisons:

body_mass_g > 3500 & body_mass_g < 4000 ~ "medium"

Here we’re combining two comparisons with the & operator.

Plug it in, Plug it in (sorry Glade)

penguins |>
  select(species, island, body_mass_g) |>
  mutate(size = case_when(
    body_mass_g < 3500 ~ "small",
    body_mass_g > 3500 & body_mass_g < 4000 ~ "medium",
    body_mass_g > 4000 ~ "large",
    TRUE ~ NA
  ))

What’s up with TRUE ~ NA?

This is what’s known as a catchall category. If our values meet none of the above criteria, then we code it as NA

This is super helpful when our criteria are not completely inclusive - there was a subset of values that were missed.

If you don’t like that, you can pass the additional .default argument instead:

penguins |>
  select(species, island, body_mass_g) |>
  mutate(size = case_when(
    body_mass_g < 3500 ~ "small",
    body_mass_g > 3500 & body_mass_g < 4000 ~ "medium",
    body_mass_g > 4000 ~ "large",
    .default = NA
  ))

In Class: Recode 1

03:00

Try to implement the following criteria for flipper_length_mm, making a new variable flipper_size:

  • If flipper_length_mm is less than 190, code as “little”
  • If flipper_length_mm is greater than or equal to 190, code as “big”
  • NA otherwise
penguins |> mutate(flipper_size = case_when( flipper_length_mm < 190 ~ "little", flipper_length_mm >= 190 ~ "big", TRUE ~ NA ))
penguins |>
  mutate(flipper_size = case_when(
    flipper_length_mm < 190 ~ "little",
    flipper_length_mm >= 190 ~ "big",
    TRUE ~ NA
  ))

Recoding Data / Conditionals

case_when() uses what are called nested conditionals - we evaluate them going down the list until we go through all conditions.

For example, for this code, we first evaluate:

 flipper_length_mm < 190

And then we evaluate for the remaining values:

 flipper_length_mm >= 190

And then we evaluate anything that doesn’t meet the above conditions. By the time we get to the bottom function, we mostly have the NAs

TRUE ~ NA

How to do it in Base-R:

It’s pretty ugly:

penguins$size = NA
penguins[!is.na(penguins$flipper_length_mm) & penguins$flipper_length_mm < 190, "size"] = "little"
penguins[!is.na(penguins$flipper_length_mm) & penguins$flipper_length_mm >= 190, "size"] = "big"

penguins

Conditionals

case_when() in recoding data is closely tied to the concept of conditionals in programming: given certain conditions, you run a specific code chunk.

if(expression_is_TRUE) {
  #code goes here
}
if(expression_A_is_TRUE) {
  #code goes here
}else if(expression_B_is_TRUE) {
  #other code goes here
}else {
  #some other code goes here
}
#if()

mutate(across())

mutate(across()) apply a function to multiple columns

Source: Allison Horst

Use mutate(across()) to do something to multiple columns

across() is a function in dplyr 1.0 that lets you specify a set of columns that uses a set of criteria. If there is one pattern that makes your life easier, it is across.

Round all numeric columns

The basic syntax:

mutate(
  across(c(bill_length_mm, bill_depth_mm), # Columns to mutate
         round)                            # the function we want to apply
)                  
penguins |>
  mutate(
    across(c(bill_length_mm, bill_depth_mm), round)
  )

This always trips me up

Notice that the function in the second argument has no ():

across(c(bill_length_mm, bill_depth_mm), round)

Instead of:

across(c(bill_length_mm, bill_depth_mm), round())

What kind of functions work here?

  • Any function that works on a vector
  • The first argument must be the vector
  • We can supply extra arguments by putting them after our function:
across(c(bill_length_mm, bill_depth_mm), round, digits=3)

In Class: across() columns

02:00

Try converting the species and island columns to uppercase using toupper()

penguins |> mutate(across(c(species,island), toupper))
penguins |>
  mutate(across(c(species,island), toupper))

Remember our tests?

  • is.numeric
  • is.double
  • is.integer
  • is.character

We can use them to select columns to mutate(across())!

We just need to wrap them in where():

across(where(is.double))

Let’s apply the round() function to columns that are double:

penguins |>
  mutate(across(where(is.double), round))

In Class: across() 2

02:00

Use mutate(across()) to change everything that is a factor in penguins to upper case using toupper.

penguins %>% mutate(across(where(is.factor), toupper))

penguins %>%
  mutate(across(where(is.factor), toupper))

tidyselect

We passed in bare column names and used where() to select columns.

There is a whole set of functions that let us select columns:

Function What it selects
everything() all variables
starts_with("a") selects variables that start with "a"
ends_with("b") selects variables that end with "b"
contains("c") selects variables that contain "c"
any_of(c("island", "species")) selects all variables in the vector

You can also negate these with a ! such as

!starts_with("a")   #selects all variables that do NOT start with "a"

Applying tidyselect

penguins |>
  mutate(across(
    starts_with("flipper"),
    as.character))

In Class: across() 3

02:00

Use mutate(across) to apply round to measurements that starts_with("bill")

penguins |> mutate( across(starts_with("bill"), round ))
penguins |>
  mutate(
    across(starts_with("bill"),
      round
  ))

Writing Functions (intro)

Using our own functions in mutate(across())

Let’s define a simple function called square:

square = function(x) {
  x * x
}

In Class: Functions 1

What do you expect when we apply it to a vector?

Why it’s good to use named arguments

If we use named arguments, we’re able to change the order of arguments when we call the function:

Can we across() it?

Try it out!

Interpreting Functions, carefully

We write functions for two main, often overlapping, reasons:

  1. Following DRY (Don’t Repeat Yourself) principle
  1. Creates modular structure and abstraction

Even if you are not writing functions (yet!), it is helpful to know how they are constructed to use them.

How functions are created

addFunction = function(num1, num2) {
  result = num1 + num2 
  return(result)
}
addFunction(3, 4)
[1] 7

When the function is called, the variables for the arguments are reassigned to function arguments to be used within the function. This ensures modularity.

Is this function modular?

x = 3
y = 4
addFunction = function(num1, num2) {
    result = num1 + num2 
    return(result)
}
addFunction(10, -5)
[1] 5

Ways to call the function

addFunction = function(num1, num2) {
    result = num1 + num2 
    return(result)
}
addFunction(3, 4)
[1] 7
addFunction(num1 = 3, num2 = 4)
[1] 7
addFunction(num2 = 4, num1 = 3)
[1] 7

But this could give a different result:

addFunction(4, 3)
[1] 7

Interpreting functions

?mean

Arithmetic Mean

Description:

     Generic function for the (trimmed) arithmetic mean.

Usage:

     mean(x, ...)
     
     ## Default S3 method:
     mean(x, trim = 0, na.rm = FALSE, ...)
     
Arguments:

       x: An R object.  Currently there are methods for numeric/logical
          vectors and date, date-time and time interval objects.
          Complex vectors are allowed for ‘trim = 0’, only.

    trim: the fraction (0 to 0.5) of observations to be trimmed from
          each end of ‘x’ before the mean is computed.  Values of trim
          outside that range are taken as the nearest endpoint.

   na.rm: a logical evaluating to ‘TRUE’ or ‘FALSE’ indicating whether
          ‘NA’ values should be stripped before the computation
          proceeds.

     ...: further arguments passed to or from other methods.

Interpreting functions

Notice that the arguments trim = 0, na.rm = FALSE have default values. This means that these arguments are optional - you should provide it only if you want to. With this understanding, you can use mean() in a new way:

numbers = c(1, 2, NA, 4)
mean(x = numbers, na.rm = TRUE)
[1] 2.333333

The use of . . . (dot-dot-dot): This is a special argument that allows a function to take any number of arguments. This isn’t very useful for the mean() function, but it makes sense for function such as select() and filter(), and mutate(). For instance, in select(), once you provide your dataframe for the argument .data, you can pile on as many columns to select in the rest of the argument.

TL;DR

  • Review mutate()
  • Use case_when()
  • Use mutate(across())
  • A little about functions

That’s all!

Maybe see you Friday 10 - 11 am PT to practice together!